Introduction

Flooding is the most common natural disaster for cities living near water. With increasing frequency and severity due to factors such as climate change and urbanization, there is a growing need for accurate flood inundation probability maps to help communities prepare and mitigate impacts.

The project aims to estimate a predictive flood inundation model that predicts the probability of an area being flooded. The model will be trained and validated using data from Calgary, Alberta (Canada) and used to predict flooding probability in Denver, a city with similar hydrologic and geographic features. This project will utilize both ArcGIS and R software to develop and analyze the flood inundation model. ArcGIS will be used to process and analyze spatial data, while R will be used for statistical analysis and machine learning algorithms. The resulting flood inundation probability maps will provide information of likelihood of flooding of Denver and provide a reference of flood inundation prediction model for other cities.

Setup

Loading libraries and jargons

To get started, we need to install and call the libraries we need. We will use a set of ggplot styles we call mapTheme and plotTheme.

library(plotROC)
library(tidyverse)
library(sf)
library(ggplot2)
library(spdep)
library(caTools)
library(plotROC)
library(caret)
library(pROC)
library(viridis)
library(gridExtra)
library(cowplot)
library(patchwork)
mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.75),
  axis.ticks=element_blank())

Data wrangling

Data preprocessing

ArcGIS Pro was employed to generate predictive features such as slope, distance to streams with drainage greater than 50km2 and 100km2, flow accumulation, and imperious percentage of surface from terrain and impervious raster data. Furthermore, the predicted variable was obtained from Calgary open data, specifically the percentage area within the 10-year (10%) inundated zone, which was subsequently employed as the target variable. A similar process was employed in generating predictive features for Denver, with the exception that inundation data was not available

predictors <- c(
  "boundary", # ---------- 1: inside city boundary; 0: outside city boundary
  "dem", # --------------- From DEM; meters
  "slope", # ------------- Percentage rise
  "dist_big_streams", # -- Distance (meters) to streams with drainage > 50 km2
  "dist_huge_streams", # - Distance (meters) to streams with drainage > 100 km2
  "flow_accumulation", # - Flow accumulation (number of cells)
  "impervious" # --------- Impervious surface as percange of area
)

targets <- c(
  "inundation_1pct", # --- Percange area in 100-year (1%) inundated zone
  "inundation_10pct" # --- Percange area in 10-year (10%) inundated zone
)

The next step involved generating fishnet data for both cities, which was then used to aggregate the raster feature data. Zonal statistics were applied to compute the average values of raster features, with a 30 by 30-meter cell size. Next, a self-defined function was used to loop through and integrate the features into separate geo-dataframes for Calgary and Denver, respectively.

add_variables_from_csv <- function(city_fishnet, city_name, variable_list) {
  for (variable in variable_list) {
    path <- paste0(
      "~/Documents/GitHub/flood-inundation-map/data/fishnet-output/tbl_", city_name, "_", variable, ".csv"
    )
    
    data <- read.csv(path) %>%
      rename(id := OID_, !!variable := value)
    
    city_fishnet <- city_fishnet %>%
      left_join(data, by = "id")
  }
  
  city_fishnet <- city_fishnet %>%
    filter(boundary > 0.9) %>%
    dplyr::select(-boundary)
  
  return(city_fishnet)
}
calgary <- st_read("~/Documents/GitHub/flood-inundation-map/data/fishnet-output/calgary_fishnet.shp") %>%
  dplyr::select(id, geometry) %>%
  add_variables_from_csv("calgary", c(predictors, targets))

denver <- st_read("~/Documents/GitHub/flood-inundation-map/data/fishnet-output/denver_fishnet.shp") %>%
  dplyr::select(id, geometry) %>%
  add_variables_from_csv("denver", predictors)

Exploratory analysis

During the data exploration and manipulation stage, histograms were plotted to examine the distribution of the features. It was observed that the distributions of some features such as dist_big_streams, dist_huge_streams, and flow_accumulation were skewed, posing problems for analysis. To address this, a logarithmic transformation was applied to these features to normalize their distribution.

Below are the histograms of Calgary dataset.

## Function to plot histograms of the variables

data_df <- as.data.frame(calgary)

plot_histograms <- function(data, predictors,id_col = "id") {
  data_df <- as.data.frame(data)
  data_numeric <- data_df %>% dplyr::select(-!!sym(id_col)) %>% dplyr::select_if(is.numeric)
  
  # Initialize an empty list to store the plots
  plots <- list()
  
  for (predictor in predictors) {
    # Create a histogram for each predictor in the dataset
    p <- ggplot(data, aes_string(predictor)) +
      geom_histogram(fill = "grey") + # Removed binwidth parameter
      labs(title = paste("Histogram of", "\n",predictor),
           x = predictor,
           y = "Frequency") +
      theme_minimal()+
      theme(plot.title = element_text(size = 12))
      
    # Add the histograms to the list
    plots <- append(plots, list(p))
  }
  
  # Arrange the plots in a grid with 3 columns and 2 rows
  grid.arrange(grobs = plots, ncol = 3, nrow = 2)
}

# Call the function
plot_histograms(data=calgary, predictors = predictors2)

Below are the histograms of Denver dataset.

plot_histograms(denver, predictors2)

# Function to engineer features

# for calgary
engineer_features_calgary <- function(data) {
  data <- data %>%
    mutate(
      log_dist_big_streams = log(dist_big_streams),
      log_dist_huge_streams = log(dist_huge_streams),
      log_flow_accumulation = log(flow_accumulation),
      inundated = ifelse(inundation_10pct > 0.5, 1, 0)
    )
  return(data)
}

# for denver
engineer_features_denver <- function(data) {
  data <- data %>%
    mutate(
      log_dist_big_streams = log(dist_big_streams),
      log_dist_huge_streams = log(dist_huge_streams),
      log_flow_accumulation = log(flow_accumulation),
      #inundated = ifelse(inundation_1pct > 0.5, 1, 0)
    )
  return(data)
}

As for the target variable inundated, which was binarized from the original variable inundation_10pct, percentage area in 10-year (10%) inundated zone. The inundated variable was assigned a value using an ifelse statement. If the inundation_1pct variable was greater than 0.5, then the inundated variable was assigned a value of 1, which is positive in inundation. Otherwise, the inundated variable was assigned a value of 0.

# Get the predictors and target

predictors_used <- c(
  "dem", # --------------- From DEM; meters
  "slope", # ------------- Percentage rise
  "log_dist_big_streams",
  "log_dist_huge_streams",
  "log_flow_accumulation",
  "impervious" # --
)

lag_predictors_used <- predictors_used %>%
  sapply(., function(x) {
    paste0("lag_", x)
  }) %>%
  unname()


all_predictors_used <- c(predictors_used, lag_predictors_used)

target_used <- c("inundated")

calgary_used <- engineer_features_calgary(calgary) %>%
  dplyr::select(all_of(c("id", predictors_used, target_used)))

denver_used <- engineer_features_denver(denver) %>%
  dplyr::select(all_of(c("id", predictors_used)))

In the context of spatial analysis, it is important to account for spatial autocorrelation, which is the tendency for the values of a variable to be more similar at nearby locations than at distant ones. The calculate_spatial_lags function is used to address this issue by generating spatial lag variables, which are the weighted averages of the values of a variable in the neighboring locations. Incorporating spatial lags into our model helps capture the spatial patterns and dependencies within the data, leading to more accurate and reliable predictions.

# add predictor spatial lags

calculate_spatial_lags <-
  function(fishnet, predictors, id_col = "id", geometry_col = "geometry") {
    # Create neighbors list using the 'geometry' column
    nb <- poly2nb(fishnet, row.names = fishnet[[id_col]])
    
    # Create spatial weights matrix
    swm <- nb2listw(nb, style = "W", zero.policy = TRUE)
    
    # Calculate spatial lags for the specified predictor variables
    for (predictor in predictors) {
      spatial_lag_colname <- paste0("lag_", predictor)
      predictor_values <- as.numeric(fishnet[[predictor]])
      fishnet[[spatial_lag_colname]] <- lag.listw(swm, predictor_values, zero.policy = TRUE)
    }
    
    return(fishnet)
  }

calgary_used <- calgary_used %>%
  calculate_spatial_lags(predictors_used)

denver_used <- denver_used %>%
  calculate_spatial_lags(predictors_used)

Model building

Partition training and test sets

To build the model, relevant features and the target variable were selected, and then the data was randomly split into training (75%) and testing (25%) sets.

calgary_for_model <- calgary_used %>%
  dplyr::select(all_of(c(all_predictors_used, target_used)), id)

train_ratio <- 0.75
sample <- sample.split(calgary_for_model$inundated, SplitRatio = train_ratio)
train <- subset(calgary_for_model, sample == TRUE)
test <- subset(calgary_for_model, sample == FALSE)

Train a logistic model

The training set was used to train a logistic model using the glm() function, which specified the target variable inundatedand all other predictors. The predict function was then used to generate predicted probabilities for the test dataset, which was stored in the predicted_probs column.

model1 <- glm(
  inundated ~ ., train %>% dplyr::select(-id) %>% st_drop_geometry(),
  family = "binomial"(link = "logit")
)

test$predicted_probs <- predict(model1, test, type = "response")

Model summary

The summary output of the training logistic model provides the results of the logistic regression model. The summary shows which features are statistically significant in predicting the target variable in Calgary. In this case, the features dem, slope, log_flow_accumulation, lag_dem, lag_log_dist_big_streams, lag_log_dist_huge_streams, and lag_impervious were found to be statistically significant in predicting the target variable. This means that these features have a significant impact on the likelihood of an area being inundated.

summary(model1)
## 
## Call:
## glm(formula = inundated ~ ., family = binomial(link = "logit"), 
##     data = train %>% dplyr::select(-id) %>% st_drop_geometry())
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3118  -0.0315  -0.0080  -0.0016   3.3563  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               54.07350    7.63623   7.081 1.43e-12 ***
## dem                       -0.23006    0.06559  -3.508 0.000452 ***
## slope                     -0.10391    0.05762  -1.803 0.071335 .  
## log_dist_big_streams      -0.37555    1.71540  -0.219 0.826704    
## log_dist_huge_streams     -0.16387    1.72417  -0.095 0.924280    
## log_flow_accumulation      0.21305    0.06779   3.143 0.001674 ** 
## impervious                 1.45692    0.83760   1.739 0.081965 .  
## lag_dem                    0.17787    0.06521   2.728 0.006380 ** 
## lag_slope                 -0.04539    0.11510  -0.394 0.693352    
## lag_log_dist_big_streams   1.77704    2.45786   0.723 0.469677    
## lag_log_dist_huge_streams -2.67285    2.45601  -1.088 0.276466    
## lag_log_flow_accumulation  0.41361    0.15708   2.633 0.008461 ** 
## lag_impervious             2.40954    1.24950   1.928 0.053805 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1145.71  on 6438  degrees of freedom
## Residual deviance:  445.31  on 6426  degrees of freedom
## AIC: 471.31
## 
## Number of Fisher Scoring iterations: 11

After identifying the statistically significant features, maps of those features were plot to visualize the spatial distribution of Calgary.

# Create individual plots
plot1 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=dem)) +
  scale_fill_viridis() +
  labs(title="DEM in meters") + mapTheme

plot2 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=slope)) +
  scale_fill_viridis() +
  labs(title="Slope,percentage rise") + mapTheme

combined_plot <- plot_grid(plot1, plot2,ncol=2) 
print(combined_plot)      

plot3 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=log_flow_accumulation)) +
  scale_fill_viridis(name="") +
  labs(title="Log flow accumulation") + 
  mapTheme

plot4 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=lag_log_dist_big_streams)) + 
  scale_fill_viridis(name="") +
  labs(title="Spatial lag log", "\nDistance (meters) to streams with drainage > 50 km2") +
  mapTheme

combined_plot <- plot_grid(plot3,plot4,ncol=2)
print(combined_plot)  

plot5 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=lag_log_dist_huge_streams)) + 
  scale_fill_viridis(name="") +
  labs(title="Spatial lag log", "\nDistance (meters) to streams with drainage > 100 km2") + mapTheme

plot6 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=lag_impervious)) + 
  scale_fill_viridis(name="") +
  labs(title="Spatial lag","\nImpervious surface as percange of area") + mapTheme

combined_plot <- plot_grid(plot5, plot6,ncol=2)
print(combined_plot)  

Model evaluation

Probability density plots

The summary output only gives the overall AIC model performance and statistically significant features. However, the summary could not evaluate the error and accuracy.

ggplot(test, aes(x = predicted_probs, fill = as.factor(inundated))) +
  geom_density() +
  facet_grid(inundated ~ ., scales = "free") +
  xlim(-0.1, 1) +
  labs(
    x = "Predicted Probability of Inundation",
    y = "Probability Density",
    title = "Distribution of predicted probabilities by observed outcome"
  )+
  scale_fill_manual(values = c("dark blue", "dark red"),
                    labels = c("Not Inundated","Inundated"),
                    name = "")+
  geom_vline(xintercept = .5)

To take a closer look, probability density plots were generated to show the distribution of predicted probabilities of inundation for the test set. The plots are arranged in a grid, with one plot for each value of the inundated variable. The blue plot represents areas that are not inundated, while the red plot represents areas that are inundated. The vertical line represents a 0.5 probability of inundation. If predicted probability is below 0.5, the predicted class is 0 (not inundated) and vice verses.

From the plot, we can observe that the False Negative rate is rather high. This indicates that there are cases where the actual class is 1 (inundated), but the model incorrectly assigns a predicted class of 0 (not inundated) due to the predicted probability being below 0.5. This suggests that the model may not be accurately identifying areas that are likely to be inundated.

Although the False Negative rate is high, the plot reveals a high True Positive rate. This means that the model is correctly predicting areas that are likely to be inundated, with a high probability. The high True Positive rate is encouraging, as it indicates that the model has some predictive power and can be useful in identifying areas that are vulnerable to flooding. However, it is still important to address the False Negative rate to improve the model’s accuracy and reliability.

ROC and AUC plots

The plot below shows the ROC curve and AUC for the model. The AUC value of 0.99 indicates that the model has a very high degree of accuracy in predicting whether an area will be inundated or not. However, it has potential to be overfitting with new datasets.

roc_data <- data.frame(
  D = as.logical(test$inundated),
  M = test$predicted_probs
)

ggplot(roc_data, aes(d = D, m = M)) +
  geom_roc() +
  geom_abline(slope = 1, intercept = 0, linewidth = 1.5, color = "grey") +
  labs(
    title = "ROC Curve",
    subtitle = paste("Area Under Curve (AUC):", round(auc(pROC::roc(test$inundated, test$predicted_probs)), 4)),
    x = "False Positive Rate (FPR)",
    y = "True Positive Rate (TPR)"
  ) 

Map prediction

Train on entire Calgary

To generated predicted inundation, the entire Calgray datasetcalgary_for_model was used to create a logistic regression model. where the target variable is inundated, and all other variables are predictors. After training the model, the predict function is used to generate predicted probabilities for the entire Calgary dataset. The predicted probabilities are stored in the calgary_for_model$predicted_probs column.

model2 <- glm(
  inundated ~ ., calgary_for_model %>% dplyr::select(-id) %>% st_drop_geometry(),
  family = "binomial"(link = "logit")
)

summary(model2)
## 
## Call:
## glm(formula = inundated ~ ., family = binomial(link = "logit"), 
##     data = calgary_for_model %>% dplyr::select(-id) %>% st_drop_geometry())
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6114  -0.0325  -0.0092  -0.0021   3.2621  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               49.06877    6.35052   7.727 1.10e-14 ***
## dem                       -0.26483    0.05540  -4.780 1.75e-06 ***
## slope                     -0.10228    0.04887  -2.093  0.03635 *  
## log_dist_big_streams      -0.05956    1.48488  -0.040  0.96800    
## log_dist_huge_streams     -0.06456    1.50139  -0.043  0.96570    
## log_flow_accumulation      0.25224    0.06038   4.177 2.95e-05 ***
## impervious                 2.31083    0.73405   3.148  0.00164 ** 
## lag_dem                    0.21849    0.05495   3.976 7.00e-05 ***
## lag_slope                 -0.07637    0.09697  -0.788  0.43096    
## lag_log_dist_big_streams   1.02312    2.12614   0.481  0.63037    
## lag_log_dist_huge_streams -2.38746    2.13018  -1.121  0.26238    
## lag_log_flow_accumulation  0.32242    0.13171   2.448  0.01437 *  
## lag_impervious             0.50835    1.05915   0.480  0.63125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1527.60  on 8584  degrees of freedom
## Residual deviance:  605.37  on 8572  degrees of freedom
## AIC: 631.37
## 
## Number of Fisher Scoring iterations: 11
calgary_for_model$predicted_probs <- predict(model2, calgary_for_model, type = "response")

Confusion matrix map for Calgary

To map the predicted inundation with error and accuracy, a confusion matrix map was plotted. The mutate function is used to create a new column called confResult, which categorizes the predicted and actual classes into four possible outcomes: True Negative, True Positive, False Negative, and False Positive.

calgary_for_model %>%
  mutate(confResult=case_when(predicted_probs < 0.5 & inundated==0 ~ "True_Negative",
                              predicted_probs >= 0.5 & inundated==1 ~ "True_Positive",
                              predicted_probs < 0.5 & inundated==1 ~ "False_Negative",
                              predicted_probs >= 0.5 & inundated==0 ~ "False_Positive")) %>%
  ggplot()+
  geom_sf(aes(fill = confResult), color = "transparent")+
  scale_fill_manual(values = c("Red","Orange","Light Blue","Light Green"),
                    name="Outcomes")+
  labs(title="Confusion Metrics") +
  mapTheme

Predicted inundation map for Calgary

Below is the predicted inundation map for Calgary trained using entire Calgary dataset.

ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=predicted_probs)) + 
  scale_fill_gradient(low = "blue", high = "red", name = "Predicted probability") +
  labs(title="Predicted inundation") + mapTheme

Predicted inundation map Denver

Then the entire Calgary trained trained logistic regression model was used to predict inundation probabilities for the Denver dataset. The predict function is used to generate predicted probabilities for the Denver dataset using the model object created previously. The predicted probabilities are then stored in the predicted_probs column of the denver_used dataset and mapped using ggplot. This step allows us to apply the model to new data and evaluate how well it generalizes to other areas.

Though the predicted probabilities for Denver were comparatively small, the predicted class were distinguishable in

denver_used$predicted_probs <- predict(model2, denver_used, type = "response")

ggplot() +
  geom_sf(data=denver_used, aes(fill=predicted_probs)) + 
  scale_fill_gradient(low = "blue", high = "red",name = "Predicted probability") +
  labs(title="Predicted inundation") + mapTheme

Summary

This project aimed to estimate flood inundation probabilities using a predictive model for the cities of Calgary and Denver. Using logistic regression models, the prediction performance was evaluated using a confusion matrix and an ROC curve, which indicated an AUC of 0.99 and overall good performance.

However, the predicted probabilities for Denver were comparatively small, indicating a limitation in the model’s ability to generalize to new areas. Furthermore, the AUC is unusually high which might indicate potential overfitting problems. For future improvement, one possible consideration is to validate the model using additional cities, which could make the model more rigorous and accurate in predicting flood inundation probabilities for new locations.

---
title: "Flood Inundation Modeling"
author: "Yingtong_Zhong"
date: "2023-03-28"
output:
  html_document:
    toc: true
    toc_float: true
    code_folding: hide
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction
Flooding is the most common natural disaster for cities living near water. With increasing frequency and severity due to factors such as climate change and urbanization, there is a growing need for accurate flood inundation probability maps to help communities prepare and mitigate impacts.

The project aims to estimate a predictive flood inundation model that predicts the probability of an area being flooded. The model will be trained and validated using data from Calgary, Alberta (Canada) and used to predict flooding probability in Denver, a city with similar hydrologic and geographic features. This project will utilize both ArcGIS and R software to develop and analyze the flood inundation model. ArcGIS will be used to process and analyze spatial data, while R will be used for statistical analysis and machine learning algorithms. The resulting flood inundation probability maps will provide information of likelihood of flooding of Denver and provide a reference of flood inundation prediction model for other cities.


# Setup

## Loading libraries and jargons
To get started, we need to install and call the libraries we need. We will use a set of ggplot styles we call `mapTheme` and `plotTheme`.



```{r libraries, warning = FALSE, message = FALSE}
library(plotROC)
library(tidyverse)
library(sf)
library(ggplot2)
library(spdep)
library(caTools)
library(plotROC)
library(caret)
library(pROC)
library(viridis)
library(gridExtra)
library(cowplot)
library(patchwork)
```

```{r setwd, echo=FALSE}
knitr::opts_chunk$set(root.dir = "~/Documents/GitHub/flood-inundation-map")
```

```{r mapTheme, warning = FALSE, echo=TRUE}
mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.75),
  axis.ticks=element_blank())
```

# Data wrangling
## Data preprocessing

ArcGIS Pro was employed to generate predictive features such as slope, distance to streams with drainage greater than 50km2 and 100km2, flow accumulation, and imperious percentage of surface from terrain and impervious raster data. Furthermore, the predicted variable was obtained from Calgary open data, specifically the percentage area within the 10-year (10%) inundated zone, which was subsequently employed as the target variable. A similar process was employed in generating predictive features for Denver, with the exception that inundation data was not available

```{r features and target variables}
predictors <- c(
  "boundary", # ---------- 1: inside city boundary; 0: outside city boundary
  "dem", # --------------- From DEM; meters
  "slope", # ------------- Percentage rise
  "dist_big_streams", # -- Distance (meters) to streams with drainage > 50 km2
  "dist_huge_streams", # - Distance (meters) to streams with drainage > 100 km2
  "flow_accumulation", # - Flow accumulation (number of cells)
  "impervious" # --------- Impervious surface as percange of area
)

targets <- c(
  "inundation_1pct", # --- Percange area in 100-year (1%) inundated zone
  "inundation_10pct" # --- Percange area in 10-year (10%) inundated zone
)
```

The next step involved generating fishnet data for both cities, which was then used to aggregate the raster feature data. Zonal statistics were applied to compute the average values of raster features, with a 30 by 30-meter cell size. Next, a self-defined function was used to loop through and integrate the features into separate geo-dataframes for Calgary and Denver, respectively.

```{r add_data, message= FALSE, warning = FALSE}
add_variables_from_csv <- function(city_fishnet, city_name, variable_list) {
  for (variable in variable_list) {
    path <- paste0(
      "~/Documents/GitHub/flood-inundation-map/data/fishnet-output/tbl_", city_name, "_", variable, ".csv"
    )
    
    data <- read.csv(path) %>%
      rename(id := OID_, !!variable := value)
    
    city_fishnet <- city_fishnet %>%
      left_join(data, by = "id")
  }
  
  city_fishnet <- city_fishnet %>%
    filter(boundary > 0.9) %>%
    dplyr::select(-boundary)
  
  return(city_fishnet)
}
```

```{r read_data, warning = FALSE, message = FALSE,results = "hide"}
calgary <- st_read("~/Documents/GitHub/flood-inundation-map/data/fishnet-output/calgary_fishnet.shp") %>%
  dplyr::select(id, geometry) %>%
  add_variables_from_csv("calgary", c(predictors, targets))

denver <- st_read("~/Documents/GitHub/flood-inundation-map/data/fishnet-output/denver_fishnet.shp") %>%
  dplyr::select(id, geometry) %>%
  add_variables_from_csv("denver", predictors)
```

## Exploratory analysis

During the data exploration and manipulation stage, histograms were plotted to examine the distribution of the features. It was observed that the distributions of some features such as `dist_big_streams`, `dist_huge_streams`, and `flow_accumulation` were skewed, posing problems for analysis. To address this, a logarithmic transformation was applied to these features to normalize their distribution.

```{r predictors_update, echo=FALSE}
predictors2 <- c(
  "dem", # --------------- From DEM; meters
  "slope", # ------------- Percentage rise
  "dist_big_streams", # -- Distance (meters) to streams with drainage > 50 km2
  "dist_huge_streams", # - Distance (meters) to streams with drainage > 100 km2
  "flow_accumulation", # - Flow accumulation (number of cells)
  "impervious" # --------- Impervious surface as percange of area
)
```

Below are the histograms of `Calgary` dataset.
```{r plot_hist, warning = FALSE, message = FALSE}
## Function to plot histograms of the variables

data_df <- as.data.frame(calgary)

plot_histograms <- function(data, predictors,id_col = "id") {
  data_df <- as.data.frame(data)
  data_numeric <- data_df %>% dplyr::select(-!!sym(id_col)) %>% dplyr::select_if(is.numeric)
  
  # Initialize an empty list to store the plots
  plots <- list()
  
  for (predictor in predictors) {
    # Create a histogram for each predictor in the dataset
    p <- ggplot(data, aes_string(predictor)) +
      geom_histogram(fill = "grey") + # Removed binwidth parameter
      labs(title = paste("Histogram of", "\n",predictor),
           x = predictor,
           y = "Frequency") +
      theme_minimal()+
      theme(plot.title = element_text(size = 12))
      
    # Add the histograms to the list
    plots <- append(plots, list(p))
  }
  
  # Arrange the plots in a grid with 3 columns and 2 rows
  grid.arrange(grobs = plots, ncol = 3, nrow = 2)
}

# Call the function
plot_histograms(data=calgary, predictors = predictors2)
```

Below are the histograms of `Denver` dataset.
```{r plot_hist2, warning = FALSE, message = FALSE}
plot_histograms(denver, predictors2)
```

```{r engineer_features, warning = FALSE, message = FALSE}
# Function to engineer features

# for calgary
engineer_features_calgary <- function(data) {
  data <- data %>%
    mutate(
      log_dist_big_streams = log(dist_big_streams),
      log_dist_huge_streams = log(dist_huge_streams),
      log_flow_accumulation = log(flow_accumulation),
      inundated = ifelse(inundation_10pct > 0.5, 1, 0)
    )
  return(data)
}

# for denver
engineer_features_denver <- function(data) {
  data <- data %>%
    mutate(
      log_dist_big_streams = log(dist_big_streams),
      log_dist_huge_streams = log(dist_huge_streams),
      log_flow_accumulation = log(flow_accumulation),
      #inundated = ifelse(inundation_1pct > 0.5, 1, 0)
    )
  return(data)
}
```

As for the target variable `inundated`, which was binarized from the original variable `inundation_10pct`, percentage area in 10-year (10%) inundated zone.  The inundated variable was assigned a value using an ifelse statement. If the inundation_1pct variable was greater than 0.5, then the inundated variable was assigned a value of `1`, which is positive in inundation. Otherwise, the inundated variable was assigned a value of `0`.

```{r data, warning = FALSE, message = FALSE}
# Get the predictors and target

predictors_used <- c(
  "dem", # --------------- From DEM; meters
  "slope", # ------------- Percentage rise
  "log_dist_big_streams",
  "log_dist_huge_streams",
  "log_flow_accumulation",
  "impervious" # --
)

lag_predictors_used <- predictors_used %>%
  sapply(., function(x) {
    paste0("lag_", x)
  }) %>%
  unname()


all_predictors_used <- c(predictors_used, lag_predictors_used)

target_used <- c("inundated")

calgary_used <- engineer_features_calgary(calgary) %>%
  dplyr::select(all_of(c("id", predictors_used, target_used)))

denver_used <- engineer_features_denver(denver) %>%
  dplyr::select(all_of(c("id", predictors_used)))
```

In the context of spatial analysis, it is important to account for spatial autocorrelation, which is the tendency for the values of a variable to be more similar at nearby locations than at distant ones. The `calculate_spatial_lags` function is used to address this issue by generating spatial lag variables, which are the weighted averages of the values of a variable in the neighboring locations. Incorporating spatial lags into our model helps capture the spatial patterns and dependencies within the data, leading to more accurate and reliable predictions.


```{r spatial_lags, warning = FALSE, message = FALSE}
# add predictor spatial lags

calculate_spatial_lags <-
  function(fishnet, predictors, id_col = "id", geometry_col = "geometry") {
    # Create neighbors list using the 'geometry' column
    nb <- poly2nb(fishnet, row.names = fishnet[[id_col]])
    
    # Create spatial weights matrix
    swm <- nb2listw(nb, style = "W", zero.policy = TRUE)
    
    # Calculate spatial lags for the specified predictor variables
    for (predictor in predictors) {
      spatial_lag_colname <- paste0("lag_", predictor)
      predictor_values <- as.numeric(fishnet[[predictor]])
      fishnet[[spatial_lag_colname]] <- lag.listw(swm, predictor_values, zero.policy = TRUE)
    }
    
    return(fishnet)
  }

calgary_used <- calgary_used %>%
  calculate_spatial_lags(predictors_used)

denver_used <- denver_used %>%
  calculate_spatial_lags(predictors_used)
```

# Model building

## Partition training and test sets 

To build the model, relevant features and the target variable were selected, and then the data was randomly split into training (75%) and testing (25%) sets. 

```{r train_test_split, warning = FALSE, message = FALSE}
calgary_for_model <- calgary_used %>%
  dplyr::select(all_of(c(all_predictors_used, target_used)), id)

train_ratio <- 0.75
sample <- sample.split(calgary_for_model$inundated, SplitRatio = train_ratio)
train <- subset(calgary_for_model, sample == TRUE)
test <- subset(calgary_for_model, sample == FALSE)
```


## Train a logistic model

The training set was used to train a logistic model using the `glm()` function, which specified the target variable ` inundated `and all other predictors. The predict function was then used to generate predicted probabilities for the `test` dataset, which was stored in the `predicted_probs` column.


```{r train_model, warning = FALSE, message = FALSE}
model1 <- glm(
  inundated ~ ., train %>% dplyr::select(-id) %>% st_drop_geometry(),
  family = "binomial"(link = "logit")
)

test$predicted_probs <- predict(model1, test, type = "response")
```

## Model summary
The summary output of the training logistic model provides the results of the logistic regression model. The summary shows which features are statistically significant in predicting the target variable in Calgary. In this case, the features `dem`, `slope`, `log_flow_accumulation`, `lag_dem`, `lag_log_dist_big_streams`, `lag_log_dist_huge_streams`, and `lag_impervious` were found to be statistically significant in predicting the target variable. This means that these features have a significant impact on the likelihood of an area being inundated.

```{r Model_summary, warning = FALSE, message = FALSE}
summary(model1)
```

After identifying the statistically significant features, maps of those features were plot to visualize the spatial distribution of Calgary.

```{r Feature_mapping1, warning = FALSE, message = FALSE}
# Create individual plots
plot1 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=dem)) +
  scale_fill_viridis() +
  labs(title="DEM in meters") + mapTheme

plot2 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=slope)) +
  scale_fill_viridis() +
  labs(title="Slope,percentage rise") + mapTheme

combined_plot <- plot_grid(plot1, plot2,ncol=2) 
print(combined_plot)      
```
```{r Feature_mapping2, warning = FALSE, message = FALSE}

plot3 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=log_flow_accumulation)) +
  scale_fill_viridis(name="") +
  labs(title="Log flow accumulation") + 
  mapTheme

plot4 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=lag_log_dist_big_streams)) + 
  scale_fill_viridis(name="") +
  labs(title="Spatial lag log", "\nDistance (meters) to streams with drainage > 50 km2") +
  mapTheme

combined_plot <- plot_grid(plot3,plot4,ncol=2)
print(combined_plot)  
```

```{r Feature_mapping3, warning = FALSE, message = FALSE}
plot5 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=lag_log_dist_huge_streams)) + 
  scale_fill_viridis(name="") +
  labs(title="Spatial lag log", "\nDistance (meters) to streams with drainage > 100 km2") + mapTheme

plot6 <- ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=lag_impervious)) + 
  scale_fill_viridis(name="") +
  labs(title="Spatial lag","\nImpervious surface as percange of area") + mapTheme

combined_plot <- plot_grid(plot5, plot6,ncol=2)
print(combined_plot)  
```


# Model evaluation

## Probability density plots

The summary output only gives the overall AIC model performance and statistically significant features. However, the summary could not evaluate the error and accuracy. 

```{r Probability_plot, warning = FALSE, message = FALSE}
ggplot(test, aes(x = predicted_probs, fill = as.factor(inundated))) +
  geom_density() +
  facet_grid(inundated ~ ., scales = "free") +
  xlim(-0.1, 1) +
  labs(
    x = "Predicted Probability of Inundation",
    y = "Probability Density",
    title = "Distribution of predicted probabilities by observed outcome"
  )+
  scale_fill_manual(values = c("dark blue", "dark red"),
                    labels = c("Not Inundated","Inundated"),
                    name = "")+
  geom_vline(xintercept = .5)
```

To take a closer look, probability density plots were generated to show the distribution of predicted probabilities of inundation for the test set. The plots are arranged in a grid, with one plot for each value of the inundated variable. The blue plot represents areas that are not inundated, while the red plot represents areas that are inundated. The vertical line represents a 0.5 probability of inundation. If predicted probability is below 0.5, the predicted class is `0` (not inundated) and vice verses. 

From the plot, we can observe that the False Negative rate is rather high. This indicates that there are cases where the actual class is `1` (inundated), but the model incorrectly assigns a predicted class of `0` (not inundated) due to the predicted probability being below 0.5. This suggests that the model may not be accurately identifying areas that are likely to be inundated. 

Although the False Negative rate is high, the plot reveals a high True Positive rate. This means that the model is correctly predicting areas that are likely to be inundated, with a high probability. The high True Positive rate is encouraging, as it indicates that the model has some predictive power and can be useful in identifying areas that are vulnerable to flooding. However, it is still important to address the False Negative rate to improve the model's accuracy and reliability.

## ROC and AUC plots
The plot below shows the ROC curve and AUC for the model. The AUC value of 0.99 indicates that the model has a very high degree of accuracy in predicting whether an area will be inundated or not. However, it has potential to be overfitting with new datasets.

```{r roc_auc, warning = FALSE, message = FALSE}
roc_data <- data.frame(
  D = as.logical(test$inundated),
  M = test$predicted_probs
)

ggplot(roc_data, aes(d = D, m = M)) +
  geom_roc() +
  geom_abline(slope = 1, intercept = 0, linewidth = 1.5, color = "grey") +
  labs(
    title = "ROC Curve",
    subtitle = paste("Area Under Curve (AUC):", round(auc(pROC::roc(test$inundated, test$predicted_probs)), 4)),
    x = "False Positive Rate (FPR)",
    y = "True Positive Rate (TPR)"
  ) 
```

# Map prediction

## Train on entire Calgary
To generated predicted inundation, the entire Calgray dataset` calgary_for_model` was used to create a logistic regression model. where the target variable is inundated, and all other variables are predictors. After training the model, the `predict` function is used to generate predicted probabilities for the entire Calgary dataset. The predicted probabilities are stored in the `calgary_for_model$predicted_probs` column.

```{r model2, warning = FALSE, message = FALSE}
model2 <- glm(
  inundated ~ ., calgary_for_model %>% dplyr::select(-id) %>% st_drop_geometry(),
  family = "binomial"(link = "logit")
)

summary(model2)

calgary_for_model$predicted_probs <- predict(model2, calgary_for_model, type = "response")
```
## Confusion matrix map for Calgary
To map the predicted inundation with error and accuracy, a confusion matrix map was plotted. The mutate function is used to create a new column called `confResult`, which categorizes the predicted and actual classes into four possible outcomes: True Negative, True Positive, False Negative, and False Positive.


```{r Confusion_matrix, warning = FALSE, message = FALSE}
calgary_for_model %>%
  mutate(confResult=case_when(predicted_probs < 0.5 & inundated==0 ~ "True_Negative",
                              predicted_probs >= 0.5 & inundated==1 ~ "True_Positive",
                              predicted_probs < 0.5 & inundated==1 ~ "False_Negative",
                              predicted_probs >= 0.5 & inundated==0 ~ "False_Positive")) %>%
  ggplot()+
  geom_sf(aes(fill = confResult), color = "transparent")+
  scale_fill_manual(values = c("Red","Orange","Light Blue","Light Green"),
                    name="Outcomes")+
  labs(title="Confusion Metrics") +
  mapTheme
```


## Predicted inundation map for Calgary

Below is the predicted inundation map for Calgary trained using entire Calgary dataset.

```{r inundation_predicted, warning = FALSE, message = FALSE}
ggplot() +
  geom_sf(data=calgary_for_model, aes(fill=predicted_probs)) + 
  scale_fill_gradient(low = "blue", high = "red", name = "Predicted probability") +
  labs(title="Predicted inundation") + mapTheme
```

## Predicted inundation map Denver

Then the entire Calgary trained trained logistic regression model was used to predict inundation probabilities for the Denver dataset. The predict function is used to generate predicted probabilities for the Denver dataset using the model object created previously. The predicted probabilities are then stored in the `predicted_probs` column of the `denver_used` dataset and mapped using `ggplot`. This step allows us to apply the model to new data and evaluate how well it generalizes to other areas.

Though the predicted probabilities for Denver were comparatively small, the predicted class were distinguishable in 


```{r inundation_denver, warning = FALSE, message = FALSE}
denver_used$predicted_probs <- predict(model2, denver_used, type = "response")

ggplot() +
  geom_sf(data=denver_used, aes(fill=predicted_probs)) + 
  scale_fill_gradient(low = "blue", high = "red",name = "Predicted probability") +
  labs(title="Predicted inundation") + mapTheme
```


# Summary
This project aimed to estimate flood inundation probabilities using a predictive model for the cities of Calgary and Denver. Using logistic regression models, the prediction performance was evaluated using a confusion matrix and an ROC curve, which indicated an AUC of 0.99 and overall good performance.

However, the predicted probabilities for Denver were comparatively small, indicating a limitation in the model's ability to generalize to new areas. Furthermore, the AUC is unusually high which might indicate potential overfitting problems. For future improvement, one possible consideration is to validate the model using additional cities, which could make the model more rigorous and accurate in predicting flood inundation probabilities for new locations.